System and method for detecting the recognizability of input speech signals

ABSTRACT

A system and method for detecting the recognizability of input speech signal is provided. It is designed in the pre-stage of speech recognition or a dialog system. The invention detects the user&#39;s environmental condition and verifies if the input speech signal can be recognized. It mainly comprises an environment parameter generator, a signal recognition verifier, and a strategy response processor. Through the use of the invention in the pre-stage of speech recognition or a dialog system, it can precisely verify the recognizability of the input speech signal and receives the input speech signals of high recognition probability in a noisy environment. This reduces the impact caused by the receiving the input speech signals of low recognition probability. This invention thus increases the recognition probability for a recognizer.

FIELD OF THE INVENTION

The present invention generally relates to the field of speechrecognition and more specifically to system and method for detecting therecognizability of input speech signals.

BACKGROUND OF THE INVENTION

The speech recognition system usually encounters various problems causedby the environments, such as background noise and the channel effect, orother factors of the speakers, such as the accent and the speaking rate,so that the input speech is beyond the recognition capability of thesystem. Prior researches proposed various improvements over therecognition capability, however, with only limited results.

U.S. Pat. No. 6,272,460, “Method for Implementing a Speech VerificationSystem for Use in a Noisy Environment”, disclosed a system including aspeech verifier in the front stage of the system. As shown in FIG. 1, aspeech verifier 100 includes a noise suppressor 110, a pitch detector120, and a confidence determiner 130. The object is to rid of noises andobtain the pitch. The pitch value is translated into a time-variantconfidence index for determining whether the input signal at a certaintime is a speech. The confidence index is transferred to the recognizerfor assisting the recognition.

U.S. Pat. No. 6,272,461 emphasized the speech detection and theassistance in speech recognition of all the input signals regardless ofwhether the input signals are beyond the acceptable range.

The current speech recognition or dialog system does not have thecapability for sensing the environment of the usage. This implies thatthe system will blindly try to recognize the speech and generate anoutput no matter how harsh the usage environment is and no matter howthe task is beyond the system capability. As a result, the user mayreceive an erroneous answer. This not only wastes the system resource,but also leads to potentially severe outcomes.

Take the auto-attendant as an example. When the caller uses theextension number inquiry system from a noisy subway station or on thebusy street, the environmental noise will affect the signal-to-noiseration (SNR) so that the SNR is too low and beyond the systemcapability. The system will perform the speech recognition process andgenerates a wrong extension number. At the end, the caller will need torequest a customer service representative for the assistance. Thisscenario shows the waste of system resource and the failure of savingthe manpower.

On the other hand, if the system can determine whether the input signalis within the recognizable range before the system starts the actualrecognition process, the recognizable signals can be passed forrecognition while the unrecognizable signals can be responded withappropriate actions. In this manner, the possibility of successfulspeech recognition will increase.

SUMMARY OF THE INVENTION

The present invention has been made to overcome the above-mentioneddrawback of conventional speech recognition systems that have nocapability in sensing the usage environment. The primary object of thepresent invention is to provide a system and a method for detecting therecognizability of the input speech signals.

In comparison with the conventional methods, the present inventionincludes the following characteristics: (a) The present inventionemphasizes the front stage of the recognition system. By using a smallamount of system resource to detect whether the input signal can besuccessfully recognized, the efficiency of the system can be improved.(b) The recognizable signals are passed to the recognizer forrecognition, and unrecognizable signals are responded with appropriateactions. (c) The unrecognizable signals are not passed to the system forrecognition so that the system resource is saved.

To achieve the above object, the present invention provides a systemwith a front stage for detecting the recognizabiliy of input speechsignals, comprising a environment parameter generator, a signalrecognition verifier, and a strategy response processor.

The system operates as follows. First, the environment parametergenerator generates a plurality of parameters in accordance with theenvironment to represent the environment conditions or the input signalquality. Then, the signal recognition verifier, after the initialtraining, verifies whether the input signal is recognizable inaccordance with the environment parameters. When the input signal isverified as recognizable, the input signal is passed to the recognitiondevice for recognition. On the other hand, when the input signal isverified as unrecognizable, the strategy response processor is triggeredto propose a strategy to respond to the environment or signal quality ofthe user in accordance with the environment parameters.

In the embodiment of the present invention, the environment parametergenerator selects the SNR of the input signal, the probability of inputsignal being a speech, and confidence index of the system processinginput signal as the environment parameters. The strategy responseprocessor proposes different strategies to guide the user to improve.For example, when the SNR is low, the user is advised to raise the voiceor move to a quieter environment. Or, when the confidence index is low,the user is advised to speak more clearly. Then, the user is prompted toinput the signal again or is transferred to a customer servicerepresentative.

The foregoing and other objects, features, aspects and advantages of thepresent invention will become better understood from a careful readingof a detailed description provided herein below with appropriatereference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a schematic view of a conventional speech recognitionsystem and method in a noisy environment.

FIG. 2 shows a schematic view of a block diagram of the presentinvention.

FIG. 3 shows a schematic view of the block diagram of the environmentparameter generator of the present invention.

FIG. 4 shows a schematic view of the block diagram of the signalrecognition verifier of the present invention.

FIG. 5 shows an embodiment of the strategy response processor of thepresent invention.

FIG. 6 shows the experimental results of the recognition rate for asimulated noise environment with six test sets.

FIG. 7 shows the output of the error of the failure and the success ofrecognition for the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

As mentioned earlier, the speech recognition system for detecting therecognizability of the input speech emphasizes the front stage of therecognition or dialog system. FIG. 2 shows a schematic view of a blockdiagram of the present invention. As shown in FIG. 2, a speechrecognition system 200 comprises an environment parameter generator 210,a signal recognition verifier 220 and a strategy response processor 230.The functionality of each component and the system operation aredescribed as follows.

First, environment parameter generator 210 generates at least anenvironment parameter for the input signals. The environment parameterrepresents the environment conditions for the input signal or the inputsignal quality. Without the loss of generality, the embodiment of thepresent invention uses the SNR of the input signal, the probability ofinput signal being a speech, and the confidence index of systemprocessing input signal as the environment parameters. These environmentparameters can be generated by using voice activity detection (VAD) andmissing feature imputation (MFI) to obtain a clean speech signal, andthen a calculation is performed. The calculation of the environmentparameters will be described later.

Then, signal recognition verifier 220, after the initial training withthe environment parameters in advance, verifies whether the input signalis recognizable in accordance with the environment parameters. When theinput signal is verified as recognizable, the input signal is passed toa recognition device 225 for further recognition, When the input signalis verified as unrecognizable, strategy response processor 230 istriggered to respond a plurality of strategies to increase thepossibility of successful recognition.

FIG. 3 shows a block diagram of the environment parameter generator ofthe present invention. The environment parameter generator includes aSNR calculation unit 310 a, a probability calculator 310 b forcalculating the probability of the signal being a speech, and aconfidence calculator 310 c for calculating the confidence index of thesystem processing the input signal. The calculation of each calculatoris described as follows.

In the application in an actual environment, the background noiseusually directly affects the recognition rate of the system. Therefore,the present invention uses the SNR as the first environment parameter.

First, SNR calculator 310 a uses the VAD method to detect the speechpart x and the non-speech part (noise) u_(n) from the spectrum featureof the input signal y. Then, the MFI method is used to clear the noisefrom the speech part x to obtain a clean speech signal {circumflex over(x)}. Based on noise u_(n) and clean signal {circumflex over (x)}, theSNR of the input signal y, named SNR_(y) is calculated. In general, thehigher the SNR is, the higher the probability that a signal can berecognized successfully. The SNR_(y) can be expressed as the followingequation:${{{SNR}(t)} = \frac{\frac{1}{D} \cdot {\sum\limits_{d = 0}^{D - 1}{\hat{x}\left( {t,d} \right)}}}{\frac{1}{D} \cdot {\sum\limits_{d = 0}^{D - 1}{u_{n}(d)}}}},{t = {{\left. 0 \right.\sim T} - 1}}$SNR_(y) = max (SNR(t))where SNR(t) is the SNR of the signal y at the time t, and T is thetotal length of the input signal. D is the total number of the frequencybands of the input signal frequency spectrum. {circumflex over (x)}(t,d)is the clean speech spectrum feature parameter calculated by MFI methodat time t and band d. u_(n)(d) is the average of the noise spectrumfeature parameter calculated by MFI method at band d. SNRY is theSNR_(y) value of the input signal y.

In addition to the SNR, the present invention also uses the probability,P_(y), of the input signal y being a speech as the second environmentparameter. The larger the probability P_(y) is, the easier the inputsignal can be recognized successfully.

First, probability calculator 310 b uses the MFI method to calculate theprobability that the SNR is greater than 0 when the clean signalspectrum parameter x is at time t and band d.${P\left( {{{SNR}\left( {t,d} \right)} > 0} \right)} = {\int_{- \infty}^{{({t,d})}/2}{\frac{1}{\sqrt{2\pi}{{{\hat{\sigma}}_{n}(d)}}}\quad{\mathbb{e}}^{- {(\frac{{({\omega - {{\hat{\mu}}_{n}{(d)}}})}^{2}}{2{{\hat{\sigma}}_{n}^{2}{(d)}}})}}{\mathbb{d}\omega}}}$where {circumflex over (μ)}_(n) (d) and {circumflex over (σ)}_(n) ² (d)are the average and the variance of noise spectrum distributioncalculated by MFI method, respectively. ω is the value of the noise.

Then, the MFI method is used to calculate the probability that the cleansignal spectrum is a speech at time t, as follows:${{R(t)} = {\frac{1}{D} \cdot {\sum\limits_{d = 0}^{D - 1}{P\left( {{{SNR}\left( {t,d} \right)} > 0} \right)}}}},{t = {{\left. 0 \right.\sim T} - 1}}$where D is the number of the bands of the signal spectrum and T is thetotal length of the input signals.

Finally, the probability of the input signal y being a speech iscalculated as follows:$P_{y} = {{1/T}{\sum\limits_{t = 0}^{T - 1}{R(t)}}}$

The present invention uses the confidence index of the system processingthe signal as the third environment parameter. The larger the confidenceindex is, the easier the input signal can be recognized successfully.

First, confidence calculator 310 c measures the divergence between theinput signal y and the known system model distribution x on thefrequency spectrum, as expressed in the following equation:${D\left( y||x \right)} = {\int{\left\lbrack {{p(y)} - {p(x)}} \right\rbrack{\log\left( \frac{p(y)}{p(x)} \right)}{\mathbb{d}x}}}$where p(y) is the probability distribution of the spectrum parameter ofthe signal y, and p(x) is the probability distribution of the spectrumparameter of the system model. The larger the divergence D(y∥x) is, thelower the probability that the input signal can be recognizedsuccessfully is.

Then, the divergence D(y∥x) is transformed by an Sigmoid function into aconfidence index R_(y) between 0 and 1:$R_{y} = \frac{1}{1 + {\exp\left( {\alpha - \left( {D + \beta} \right)} \right)}}$where α and β are the fine-tuning parameters for enlargement and shift,respectively.

After the three environment parameters SNR_(y),P_(y) and R_(y) arecalculated, signal recognition verifier 220, after the initial trainingwith the environment parameters in advance, receives and analyzes thethree environment parameters SNR_(y),P_(y) and R_(y)to verify whetherthe input signal is recognizable, as shown in FIG. 4. The training rulewith the environment parameters can be the multi-layer perception (MLP)method of the pattern classification.

As aforementioned, when the input signal is verified as unrecognizableby the signal recognition verifier 220, the strategy response processor230 is triggered to respond with strategies. There are a plurality ofpossible strategies. FIG. 5 shows a working example of the strategy. Inthis working example, the user is informed that the input signal can notbe recognized and of what the current usage environment condition andthe signal quality are, such as step 501, to guide the user to improvethe environment condition and the signal quality. For example, when theSNR is lower than a threshold, the user is advised to raise the voice ormove to a quieter location. When the confidence index of the systemprocessing the signal is lower than a threshold, the user is advised tospeak more clearly. Then, the user is prompted to re-enter the inputsignal or transferred to a customer service representative, as step 502.

In an experiment with 936 clean speech utterances of Chinese names, ababble noise of five different SNR between 0 dB and 20 dB is added tosimulate the noise environment and generate six sets of tests, 5616 testsignals in total. With the noise interference, the recognition rate ofthe six sets of tests is shown in FIG. 6. In a noise-free environment,the recognition rate is 94.2%. When the babble noise is added, theaverage recognition rate for the six sets of tests is reduced to 64.8%.

It is obvious that the system recognition rate decreases rapidly as theSNR decreases. With the present invention adding the aforementioneddetection method in the front stage of the recognition system, theenvironment parameters are generated for every unrecognizable andrecognizable signal. FIG. 7 shows the output of the error rate for therecognizable and unrecognizable signals, respectively.

In FIG. 7, A represents the number of the utterances that therecognition device cannot recognize successfully, and B represents thenumber of the utterances that the present invention mistakenly verifiesas recognizable. Similarly, C represents the number of the utterancesthat the recognition device can recognize successfully, and D representsthe number of the utterances that the present invention mistakenlyverifies as unrecognizable. The average recognition rate of therecognition device is calculated as the ratio between the number of thecorrectly recognized utterances and the number of the utterancesentering the recognition device, that is,(C−D)/(C−D+B)=(3640−807)/(3640−807+453)=86.2%.

As seen in the above results, after the detection method of the presentinvention is added to the front stage of the recognition system, therecognition rate is improved from 64.8% to 86.2%, and the unrecognizablesignals are rejected to prevent further effect of the erroneousrecognition.

In summary, the present invention provides a system and a method fordetecting the recognizability of the input signal. The present inventionis to detect the usage environment conditions and the signal quality inthe front stage of the recognition system to verify whether the signalcan be recognized successfully. In the present invention, threeenvironment parameters, including SNR, the probability of the signalbeing a speech, and the confidence index of the system processing thesignal, are used to represent the environment conditions and the signalquality. The environment parameters are used to train the signalrecognition verifier to verify whether the signal can be recognizedsuccessfully. When the signal is verified as recognizable, the signal ispassed to the recognition device for recognition. When the signal isverified as unrecognizable, a strategy response processor is triggeredto inform the user the environment conditions and prompt the user forinputting better quality signals.

Although the present invention has been described with reference to thepreferred embodiments, it will be understood that the invention is notlimited to the details described thereof. Various substitutions andmodifications have been suggested in the foregoing description, andothers will occur to those of ordinary skill in the art. Therefore, allsuch substitutions and modifications are intended to be embraced withinthe scope of the invention as defined in the appended claims.

1. A system for detecting reconizability of an input signal, applicableto the front stage of a speech recognition device or a dialog device,comprising: an environment parameter generator to generate at least anenvironment parameter from an input signal; a signal recognitionverifier to verify whether said input signal being recognizable inaccordance with said environment parameters after initial training withenvironment parameters in advance; and a strategy response processor;wherein said input signal is passed to said speech recognition or dialogdevice when said input signal is verified as recognizable; while saidstrategy response processor is triggered to respond with a plurality ofstrategies when said input signal is verified as unrecognizable.
 2. Thesystem as claimed in claim 1, wherein said environment parametersrepresent the environment conditions or the signal quality of said inputsignal.
 3. The system as claimed in claim 2, wherein said environmentparameters include the signal-to-noise ratio (SNR) of said input signal,a probability of said input signal being a speech, and a confidenceindex of said system processing said input signal, or any combination ofsaid three environment parameters.
 4. The system as claimed in claim 3,wherein said environment parameter generator includes an SNR calculator,a probability calculator, and a confidence calculator for calculatingsaid SNR, said probability of said input signal being a speech, and saidconfidence index of said system processing said input signal,respectively.
 5. The system as claimed in claim 1, wherein said strategyis to inform said user the environment conditions and signal quality andprovide said user corresponding solutions.
 6. The system as claimed inclaim 5, wherein said environment conditions and signal quality of saidinput signal include the SNR of said input signal, a probability of saidinput signal being a speech, and a confidence index of said systemprocessing said input signal.
 7. The system as claimed in claim 5,wherein said corresponding solutions include the ways to improve saidenvironment conditions and signal quality.
 8. The system as claimed inclaim 7, wherein said ways include raising the user's voice changing toa quieter location, raising the speech clarity, and giving up therecognition process.
 9. The system as claimed in claim 8, wherein saiduser is prompted to raise the voice, change to a quieter location andre-input the signal when said SNR is lower than a threshold.
 10. Thesystem as claimed in claim 8, wherein said user is prompted to raise thespeech clarity and re-input the signal when said confidence is lowerthan a threshold.
 11. The system as claimed in claim 8, wherein saidgiving up recognition process means said input signal is not passed tosaid recognition or dialog device, or said user is transferred to acustomer service representative.
 12. A method for detecting therecognizability of an input signal, applicable to the front stage of aspeech recognition or dialog device, said method comprising the stepsof: (a) generating at least an environment parameter for said inputsignal, and said environment parameter representing the environmentconditions and the signal quality of said input signal; (b) using saidenvironment parameter to verify whether said input signal beingrecognized successfully after training with environment parameters; and(c) passing said input signal to said speech recognition or dialogdevice when said input signal being verified as recognizable; otherwise,triggering a strategy response processor to provide a plurality ofstrategies when said input signal being verified as unrecognizable. 13.The method as claimed in claim 12, wherein said environment parametersin said step (a) include the signal-to-noise ratio (SNR) of said inputsignal, a probability of said input signal being a speech, and aconfidence index of said system processing said input signal, or anycombination of the above three environment parameters.
 14. The method asclaimed in claim 12, wherein said environment parameters are generatedby a voice activity detection (VAD) method and a missing featureimputation (MFI) method.
 15. The method as claimed in claim 12, whereinthe generation of said SNR comprises following steps of: using a VADmethod on spectrum feature parameter of said input signal to detect aspeech part and a non-speech part of said input signal; using an MFImethod to eliminate noise from said speech part to obtain a clean speechsignal; and calculating said SNR of said input signal in accordance withsaid non-speech part and said clean speech signal.
 16. The method asclaimed in claim 12, wherein the generation of said probabilitycomprises the steps of: using an MFI method to calculate the probabilitythat said SNR being greater than 0 when said clean signal spectrumparameter being at a time t and a band d; using an MFI method tocalculate a probability R(t) that said clean signal spectrum being aspeech at said time t; and calculating the average of said R(t) duringthe total length of said input signal to obtain said probability of saidinput signal being a speech.
 17. The method as claimed in claim 12,wherein the generation of said confidence comprises the steps of:measuring the divergence between said input signal and a known systemmodel distribution on the frequency spectrum; and using a sigmoidfunction to transform said divergence into a confidence index between 0and
 1. 18. The method as claimed in claim 12, wherein said training withsaid environment parameters is using a multi-layer perception method ofthe pattern classification.
 19. The method as claimed in claim 12,wherein said strategy of said step (c) is to inform said user theenvironment conditions and signal quality and provide said usercorresponding solutions.
 20. The method as claimed in claim 19, whereinsaid environment conditions and signal quality of said input signalinclude the SNR of said input signal, a probability of said input signalbeing a speech, and a confidence index of said system processing saidinput signal.
 21. The method as claimed in claim 19, wherein saidcorresponding solutions include the ways to improve said environmentconditions and signal quality.
 22. The method as claimed in claim 21,wherein said ways include raising the user's voice, changing to aquieter location, raising the speech clarity, and giving up therecognition process.
 23. The method as claimed in claim 20,wherein saiduser is prompted to raise the voice or change to a quieter location andre-input the signal when said SNR is lower than a threshold.
 24. Themethod as claimed in claim 20, wherein said user is prompted to raisethe speech clarity and re-input the signal when said confidence is lowerthan a threshold.
 25. The method as claimed in claim 20, wherein saidgiving up recognition process means said input signal is not passed tosaid recognition or dialog device, or said user is transferred to acustomer service representative.